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Markus ThorrQand Gunther Palm@ 

Abstract. We propose a linear time and constant space algorithm for computing 
Euclidean projections onto sets on which a normalized sparseness measure attains 
a constant value. These non-convex target sets can be characterized as intersections 
of a simplex and a hypersphere. Some previous methods required the vector to be 
projected to be sorted, resulting in at least quasilinear time complexity and linear 
space complexity. We improve on this by adaptation of a linear time algorithm 
for projecting onto simplexes. In conclusion, we propose an efficient algorithm for 
computing the product of the gradient of the projection with an arbitrary vector. 

1 Introduction 

In a great variety of classical machine learning problems, sparse solutions are appealing because 
they provide more efficient representations compared to non-sparse solutions. Several formal 
sparseness measures have been proposed in the past and their properties have been thoroughly 
analyzed Q. One remarkable sparseness measure is the normalized ratio of the L\ norm and the 
Li norm of a vector, as originally proposed by ||2] : 

Mi 

o:R»\{0}->[0, 1], x^ . 

Here, higher values of a indicate more sparse vectors. The extreme values of and 1 are 
achieved for vectors where all entries are equal and vectors where all but one entry vanish, re- 
spectively. Further, a is scale-invariant, that is a(ajc) = o(x) for all a ^ and all x G M" \ {0}. 

The incorporation of explicit sparseness constraints to existing optimization problems while 
still being able to efficiently compute solutions to them was made possible by [2 ] through propo- 
sition of an operator, which computes the Euclidean projection onto sets on which o attains a 
desired value. In other words, given a target degree of sparseness a* € (0, 1) with respect to CT, 
numbers A,i,A<2 > can be derived such that a = G* on the non-convex set 

D '■= { s G R> o | ||j|| i = X\ and \\s\\ 2 = } • 

Clearly, either of X\ and A2 has to be fixed to a pre-defined value, for example by setting A2 := 1 
for achieving normalized vectors, as only their ratio is important in the definition of 0. By 
restricting possible solutions to certain optimization problems to lie in D, projected gradient 
descent methods Q can be used to achieve solutions that fulfill explicit sparseness constraints. 
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The projection operator of [2] was motivated by geometric ideas, in such that the intersection 
of hyperplanes, hyperspheres and the non-negative orthant were considered and a procedure of 
alternating projections was proposed. This procedure is known to produce correct projections 
when applied to the intersection of convex sets 0. In the non-convex setup considered here, it 
is not clear in the first place whether the method also computes correct projections. The results 
of Q, however, show that alternating projections also work for the sparseness projection, and 
that the projection onto D is unique almost everywhere. 

It was further noted recently that the method of Lagrange multipliers can also be used to derive 
an implicit, compact representation of the sparseness projection (6]|. The algorithm proposed 
there needs to sort the vector that is to be projected and remember the sorting permutation, 
resulting in a computational complexity that is quasilinear and a space complexity that is linear 
in the problem dimensionality n. In this work, we provide a detailed derivation of the results 
of (6l. Then, by transferring the ideas of [7 1 to efficiently compute projections onto simplexes 
we use the implicit representation to propose a linear time and constant space algorithm for 
computing projections onto D. Ultimately, we propose an algorithm that efficiently computes 
the product of the gradient of the projection onto D with an arbitrary vector. 

2 Notation and Prerequisites 

We denote the set of Boolean values with B, the real numbers with R and the ^-dimensional 
Euclidean space with R". Subscripts for elements from R" denote individual coordinates. All 
entries of the vector e G R" are unity. W xn is the ring of matrices with n rows and n columns, 
and E n G R" x " is the identity matrix. It is well-known that the L\ norm and the L 2 norm are 
equivalent in the topological sense 00 : 

Remark 1. For all x G R", we have that ||x|| 2 < ||x|| 1 < v^lklla- ^ x * s sparsely populated, 
then the latter inequality can be sharpened to [|jc||j < \fd ||x|| 2 , where d := ||je|| < n denotes the 
number of non- vanishing entries in x. 

Therefore X 2 < hi < \fn\2 must hold for the target norms to achieve a sparseness of o* G (0, 1). 
The projection onto a set contains all points with infimal distance to the projected vector |@]: 

Definition 2. Let iel"and0/MCl". Then every point in 

proj M (x) := {y GM | ||y-Jt|| 2 < ||z-*|| 2 for all z G M} 

is called Euclidean projection of x onto M. If there is exactly one point y in proj M (x), then 
y = proj M (jc) is written for abbreviation. 

We further note that projections onto permutation-invariant sets are order-preserving: 

Proposition 3. Let such that P x x G M for all x G M and all permutation matrices 

P z G R" x ". Let x G R" and p G pro] M (x). Then Xi > xj implies p\ > pj for all i,j G {1, . .. ,n}. 

Throughout the paper, we assume that the input vector to the projection operator is chosen such 
that the projection onto D is unique. As has been shown by Q, this is fulfilled by almost all 
x G R" \ { } and is thus no restriction in practice. 
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3 Implicit Representation of the Projection 



As noted by O, the method of Lagrange multipliers can be used to derive an implicit represen- 
tation of the projection onto D. To make this paper as self-contained as possible, we include an 
elaborate derivation of their result. 

Lemma 4. Let x G R> \ D such that proj^ (x) is unique. Then there exist unique numbers a£R 
and P € R>o such that proj^ (x) = max (i (jc — a • e) , 0) . 

Proof. We want to find a point p E D such that the Euclidean distance \\p— x|| 2 is minimized. 
Such a point is guaranteed to exist by the WeierstraB extreme value theorem. The constrained 
optimization problem leads to the Lagrangian £:R"xlxRx R" — > R, 

(p, a, p, y )^^\\p- x \\l + a (}\p\\ 1 -k 1 ) + ^[\\p\\l-^)-fp, 

where the multiplier p was linearly transformed for notational convenience. By taking the deriva- 
tive for p and setting it to zero we obtain 

= fipi — x, : + oc — y = 0, and hence pi = — — for all i E {1, . . . ,n\. 
dpi [3 

The complementary slackness condition, y/?,- = for all i E {1, . . . ,«}, must be satisfied in a 
local minimum of L . Hence pi > implies y = and p\ = implies y ■ > for all i E { 1 , . . . , n}. 
Let /:= {i E {1,. . . ,n} \ pi > 0} denote the set of coordinates in which p does not vanish, and 
let d := |/| denote its cardinality. We have d > 2, because d = is impossible due to X\,%2 > 
and d = 1 is impossible because X\ ^ A,2- Further, y = for all i E / from the complementary 
slackness condition. Let x E R> be the vector with all entries from x with index in /, that is 
when /={«!,..., i c /} then x T = (x,-, , . . . , x !rf ). Note that because all entries of p and x are non- 
negative, the sum over their entries is identical to their L\ norm. By taking the derivative of the 
Lagrangian for a and setting it to zero we have that 

^i = Iblli =EPi = Ep(*i-a) = p(H*Hi _da )- 

r'G/ 167 

Analogously, taking the derivative for [3 and setting it to zero yields 

^2 = IHI2 = pl^( x ' _oc ) 2 = ■p(Pll2 _2a Plli +da 2 ). 

iel 

By squaring the expression for X\ and dividing by A| we get 

^i _ Plli -2doc||xi|| +d 2 v? 
^2 Pll2 — 2ot || j + da 2 
which leads to the quadratic equation 

°=^-!)-« 2 + 2 piii(! 

— a =:b 



■ -d -oc + 
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Before considering the discriminant of this equation, we first note that d ||x|| 2 — ||x|| j > with 

RemarkQ] As p exists by the WeierstraB extreme value theorem and has by definition d non-zero 

x 2 

entries, we also have that d — -A > using RemarkQ] Thus we obtain 

D :=b 2 - 4ac = 4 \\x\\l (d - 2 - Ad (d - (\\x\\l - £| ||x| 
= 4§ (d-H) (d\\x\\?- ||x||f) >0, 



so a must be a real number. Solving the equation leads to two possible values for a: 

-b±y/D 



a £ 



2a 




We first assume that a is the number that arises from the "+" before the square root. From 
^i = Hp 1 1 1 we then obtain 



' d 1 1 JC 1 1 2 



p= iMl - rf „) = -y^f-^<o. 



plementary slackness condition then yield pi — pj = h (x* — (X — Xj + a) = ^ (x, — Xj) < 0, which 
contradicts the order-preservation as guaranteed by Proposition [3] Therefore, the choice of a 



With d > 2 there are two indices i, j £ / with x, > xj. The derivative of L for p and the com- 
plementary slackness condition then yield 
contradicts the order-preservation as guar; 
was not correct in the first place, and thus 

_ ||~||2\ 

andp 





dh.2 — A< j 

must hold. Let i € /, then < pi = p (x, — a), and because P > follows x; > a. For i / 
it is = pi = h (x; — oc + Yi) where y > 0, so = x; — a + y > x, — a, or equivalently x, < a. 
Ultimately, we have that pi = max (A (x,- — a) , 0) for all i 6 {1, ... ,n}. 

For the claim to hold, it now remains to be shown that a and p are unique. With the uniqueness 
of the projection p, we thus have to show that from 

p = max (jjj- (x — oti-e), 0) = max (x— CC2 ■ e) , 0) 

for 0C1 , 0C2, Pi , P2 £R follows that 0C1 = 0C2 and Pi = P2. As shown earlier, there are two distinct 
indices i,j G / with x, ^ xj and Pi,Pj > 0. We hence obtain 

Pi = FT ( x ' ~ a i) = (*i - «i) and = £ (x/-0Ci) = ^ (x 7 - - oc 2 ) , 

and thus & = = f^. Therefore, 

JJ j JCj \X [ JCj *Jv2 

= (x,- — cxi)(x 7 - — 012) — (x, — a<2){xj — cci) = ai(x, — x ; -) — a2(x,- — x 7 -) = (oil — a2)(x ( - — xj). 
With x, ^ x y - we have that (Xi = (X2, and substitution in either of pj or pj shows that Pi = p2- n 
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We note that the crucial point in the computation of a is rinding the set / where the projection 
has positive coordinates. With the statement of Proposition [3]the argument of the projection can 
be sorted before-hand such that / = {1 , . . . , d} and therefore only a number linear in n of feasible 
index sets has to be checked. This is essentially the method proposed by 0. The drawback of 
this approach is that the time complexity is quasilinear in n because of the sorting, and the space 
complexity is linear in n because the permutation has to be remembered to be undone afterwards. 



4 Finding the Zero of the Auxiliary Function 

Lemma |4] gives a compact expression that characterizes projections onto D. We first note that 
the representation only depends on one number: 

Remark 5. Let x G R" \D such that proj D (x) is unique. Then there is exactly one a£K such 

that nrni M — ^-max^-q-e, 0) 

tnatpro] D (x) - || max(v _ a . ei oJikT" 

Proof. The projection becomes proj Z) (x) = max (i (x — a - e) , 0) with unique numbers agM 
and P G R>o due to Lemma 0] With p > we have that X 2 = || max (x— a- e) , 0) || 2 = 
jj 1 1 max (x — a-e, 0) || 2 , and the claim follows. n 

It can hence be concluded that the sparseness projection can be considered a soft variant of 
thresholding (9]: 

Definition 6. The function S a '■ R — > R, x \— > max(x — a, 0), is called soft-shrinkage function, 
where a G R. It is continuous on R and differentiable exactly on R \ { a }. 

With Remark [5] we know that we only have to find one scalar to compute projections onto D. 
Analogous to the projection onto a simplex Q, we can thus define an auxiliary function which 
vanishes exactly at the number that yields the projection: 

Definition 7. Let x G R" \D such that proj D (x) is unique and o(x) < o*. Let the maximum 
entry of x be denoted by x max := max, e n r .. )n }X/. Then the function 

*:[0,x max )^R, a ^f-^^ -T 

||max(x-a-<?, 0)|| 2 A 2 

is called auxiliary function to the projection onto D. 

Note that the case of a(x) > a* is trivial, because in this sparseness-decreasing setup we have 
that all coordinates of the projection must be positive. Hence / = {l,...,n} in the proof of 
Lemma HI and the shifting scalar a can be computed from a closed-form expression. 
We further fix some notation for convenience: 

Definition 8. Let x G R> be a vector. Then we write X := {x, | i G {1, . . . ,n} } for the set of 
entries of x. Further, let x m i n := min X be short for the smallest entry of x, and x max := maxX 
and X2nd-max := maxX \ {x max } denote the two largest entries of x. 
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Offset of Soft-Shrinkage a 

Figure 1: Plot of the auxiliary function *P and its derivative for a random vector x (see Lemma|9] 
for an analysis). The derivative *F' was scaled using a positive number for improved 
visibility. The steps in *F' are exactly the places where a coincides with an entry of x. 
With Remark[TT] it is sufficient to find an a such that *F(jc/) > and *F(x*) < for the 
neighboring entries Xj and Xk in x, because then the exact solution a* can be computed 
with a closed-form expression. 



Let q : R — >• R n , a h-> max (x — a-e, 0), denote the curve that evolves from entry-wise appli- 
cation of the soft-shrinkage function. Let the Manhattan norm and Euclidean norm of points 
from q be given by l\ : R ->• R, a i-> ||?(a)|| v and £ 2 '■ R ->■ R, oc (->■ ||?(a)|| 2 , respectively. We 
thus have that \F = ^ — ^, and we find that g(oc) / if and only if a < x max . 

In order to efficiently find the zeros of *P we first investigate its analytical properties. See Fig- 
ure [T]for an example of *P that provides orientation for the next result. 

Lemma 9. Let x G R> \D be given such that the auxiliary function is well-defined. Then: 

(a) is continuous on [0, x max ). 

(b) *P is differentiable on [0, x max ) \X. 

(c) *P is strictly decreasing on [0, X2 n d-max) and constant on [x2 n d- m ax, x max ). 

(d) There is exactly one a* G (0, X2 n d-max) with *P(a*) = 0. 

(e) proj D (x) = ^ m ™^ a C, e o)t_ Where a * is the zer ° ° f ^ 

Proof, [(a)] ^ is continuous because the soft-shrinkage function is continuous. Hence so are t\ 
and £2, and hence *P as compositions of continuous functions. 



(b)]The soft-shrinkage function causes *P to be differentiable exactly on [0, x max ) \X. Now 



let xj < Xk be two successive elements from X, such that there is no element from X be- 
tween them. In the case that x* = x m j n it can be assumed that xj = 0. Then the index set 
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I:={ie{l,...,n} | %i > a} of non- vanishing coordinates in q is constant for a G (xy, Xfc), and 
the derivative of *F can be computed using a closed-form expression. For this, let d := |/| de- 
note the number of nonzero entries in q. With 4(a) = (x,- — a) = Y*iei x i ~ ^ a we obtain 
i\(a) = -d. Analogously, it is ^^(a) 2 = ^ £ fe7 (x,- - a) 2 = -2£,- G/ (x;-a) = -24(a), and 
hence £' 2 {a) = ^ y/l 2 (a) 2 = ^(a) -1 ^ 2 (a) 2 = -§M- Therefore, the quotient rule yields 

-J4(a)+4(a)§{§ i ^ l(a)2 
*P (a) = — - ' 



4(a) 2 4(a) V4(a) 2 

It can further be shown that higher derivatives are of similar form. We have that ^4 (a) 2 
24 (a)£[ (a) = -2di x (a), and thus 

3 4(a) 2 -2d4(a)4(a) 2 + 24(a) 3 4(a) /4(a) 2 



da£ 2 (a) 2 4(a) 4 4(a) 2 \£ 2 (a) 

We also obtain = ^ = and eventually 



2(a) 3 V4(a) 2 ; 4(a) 4(a) 2 V^(a) 2 7 4(a) 3 \£ 2 (a 



iu a *"(«) i ^i(a) 

or m other words ^ = 3^. 



[(c)] First let a G (x2 n d-max, -^max)- With the notation of |(b)| we then have that d = I, such that q 
has exactly one non- vanishing coordinate. Hence, 4(a) — 4(a) and = on (x2nd-maxj -^max)) 
thus *F is constant on (x2 n d-max> -*max) as a consequence of the mean value theorem from real 
analysis. Because is continuous, it is constant even on [x2 n d- m ax> *max)- 

Next let a 6 [0, x 2n d- m ax) \X, then d > 2 and l\ (a) < Vd£ 2 (a) with RemarkQ] The inequality 
is in fact strict, because q(a) has at least two distinct nonzero entries. This implies that < 
on (xj, Xk) where xj < Xk are neighbors of a as in (b) The mean value theorem then guarantees 
that m is strictly decreasing between neighboring elements from X. This property holds then for 
the entire interval [0, X2 n d-max) due to the continuity of x ¥. 
|(d) | We have by requirement that c(x) < o*, and therefore jj^ 1 > and so *P(0) > 0. For 



a £ (x2nd-max, -^max) we obtain £\ (a) = 4(a) as in[(c)] It is then *P(a) < using X 2 <Xy. The 
existence of a* 6 [0, X2 n d-max) with ^(a*) = follows from the intermediate value theorem 
and |(c) | Uniqueness of a* is guaranteed because *P is strictly monotone. 



(e) With Remark [5] there is exactly one 6c € E such that proj^x) = j^™*^ -e 0)^ • ^ e see 



that x ¥(a) = 0, and the uniqueness of the zero of *P implies that a* =a. □ 

The unique zero of the auxiliary function can be found numerically using standard root-finding 
algorithms, such as Bisection, Newton's method or Halley's method ifTOl . We can improve on 
this by noting that whenever a number is found in a certain interval, then the exact value of the 
zero of *P can already be computed. 

Remark 10. Let x G M" \ D such that the auxiliary function *P is well-defined. Then there are 
two unique numbers xj < x^, where either xj = and Xk = x m ; n or xj,Xk £ X such that there 
is no other element from X in between, such that ^(xj) > and ^(x^) < and there is an 
a* € [xj, x k ) with»F(a*) =0. 
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Proof. Let a* G (0, .*2nd-max) with *F(a*) = be given with Lemma[9l When a* < x min holds, 
existence follows immediately with Lemma [9] by setting Xj := and x k := x min . Otherwise, 
define xj := max{x,- | x,- G X and x,- < a* } and x k := min{x, | x, € X and x,- > a* }, which both 
exist as the sets where the maximum and the minimum is taken are nonempty. Clearly these two 
numbers fulfill the condition from the claim by Lemma |9] The bracketing by xj and x k is unique 
because a* is in both cases unique with Lemma [9] □ 

We further note that it is easy to check whether the correct interval has already been found, and 
give a closed-form expression for the zero of *P in this case: 

Remark 11. Let x G M> \Z) such that the auxiliary function ^ is well-defined and let a G 

[0, x max ). If a < x m i n define xj := and x k := x m [ n , otherwise let xj < a < x k with xj,x k G X 
such that there is no element from X between xj and x k . Let / := {i G {1, . . . ,n} | x,- > a} = 
{h,...,id} where d := \I\ andxGM> such thatx r = (x (1 , . . . ,x !rf ). Then the following holds: 

(a) = ||x||j -d\ and = \\x\\\-2% \\x\\ x + d^ 2 for £, G {xj, a, x k }. 

(b) When X 2 £ l (x j ) > Xi£ 2 (xj) and X 2 £i(x k ) < X^ixk) hold, then 

is the unique zero of 

Proof, [(a)] We have that £i(a) = £iei( x i — cx) = Lie/*/ — da= {{x^—da and further ^(cc) 2 = 
L iG /(x,--oc) 2 = £ ; - e/ (x? - 2ocx,- + a 2 ) = 1 1 Jc 1 1 ^ — 2a||x||j + da 2 . 

Now let K := { i G {1, . . . ,n} | x,- > x k } and £ := { i G {1, . .. ,n} \xt = x k }. Then K = I\K, 
and thus £i(x*) = £ ; - eS -(x, -x k ) = £ ; - e7 (xi ~ x k) ~ LieK&i ~ x k) = Liei( x i- X k) = \\ x \\i ~dx k . 
Likewise follows l2(x k ) 2 = [|x|| 2 — 2x k \\x\\i +dx 2 . 

Finally, let J := { i G {1 , . . . ,n} | x,- > Xj } and /:= { i G {1 , . . . , n} | x,- = Xj }. Then I = J\f, 

and we obtain £\ (xj) = £ te /(x ! -x / ) = Ziei( x i- x j)+'Liej( x i- x j) =Y<iei{ x i ~ x j) = \\ x \\i~ dx j- 
The claim for £ 2 (xj) 2 follows analogously. 

|(b)| The condition from the claim is equivalent to ^(xj) > and *P(xyt) < 0. Hence with 
RemarkfTOl there is an a* G [xj, x k ) with *F(a* ) = 0. Let p := proj D (x) be the projection of x onto 
D and define / := {i G {1, . . . ,«} | p\ > 0}. Lemma | 9je) | implies / = { i G {1, . . . ,n} | x,- > a* }. 
Furthermore, it is 7 = { i G { 1, . . . , n} | x ( - > X/ } = { i G { 1, . . . , n} | x ( - > oc } = /. Thus we already 
had the correct set of non- vanishing coordinates of the projection in the first place, and the 
expression for a* follows from the proof of Lemma [4] n 

5 A Linear Time and Constant Space Projection Algorithm 

By exploiting the analytical properties of *F, simple methods are sufficient to locate the interval 
in which its zero resides. Because the interval has a positive length, simple Bisection is guaran- 
teed to find it in a constant number of steps (7). Empirically, we found that solvers that use the 



8 



Algorithm 1: Linear time and constant space evaluation of the auxiliary function 
Input: x e R> , A,i,A^,a 6 M with < < ^i < \/"^2 ana " < oc < max/gr^ n ix/. 
Output: ^{a)^'{a)y'{a),^{a)^'{a) e E, finished e B, £ u £j el,rfeN. 

// Initialize. 

1 l\ := 0; ^ := 0; d := 0; xj := 0; Ax 7 - := -a; x k := 00; Ax* := 00; 

// Scan through x. 

2 for i := 1 to n do 

3 t := — a; 

4 if f > then 
5 

6 
7 
8 
9 

10 end 

// Compute ¥(a), ¥'(a) and ¥"(a). 

11 ^i(oc) :=h-da; 4(a) 2 := ^ - 2a£i +da 2 ; 



t x ■=£ 1 +x i ; t\;=l\+x}\ d:=d+\; 
if t < Axk then x k := x,-; Ax,t := t; ; 



se 



end 



if t > Axj then x y - 



12 »F(a 



(?,(a) 2 
7 



// Compute ^(a) and 4" (a). 



</); ¥"(a) := 



3¥'(a)li 1 (a) , 



13 Woe) 



^(a) 2 »? 



<? 2 (a) 2 X\ 



3: n«) :=2 



^'(oc); 



// Compute *^(xy) and *F(xfc), check for sign change and return. 



14 finished := ^(li —dxj) > Xiy^ 2 — 2xji\ +dx 2 j and (l\ — dx^) < X[ \jt\ — 2x^1 +dx^; 
is return (¥(cc), »P'(a), ¥"(a), ❖(a), ❖'(a), finished, £ u t\, d); 



derivative of converge faster, despite of the step discontinuities of *F'. We have implemented 

Newton's method, Halley's method, and Newton's method applied to the slightly transformed 

f X 2 

auxiliary function *}/ := 4 — A. These methods were additionally safeguarded with Bisection 
to guarantee new positions are located within well-defined bounds fTTTl . This does impair the 
theoretical property that only a constant number of steps be required to find a solution, but in 
practice a significantly smaller number of steps needs to be made compared to plain Bisection. 
This is demonstrated through experimental results in Section |7J 

We are now in a position to formulate the main result of this paper, by proposing an efficient 
algorithm for computing sparseness-enforcing projections: 

Theorem 12. Algorithm |2] computes projections onto D, where unique, in a number of opera- 
tions linear in the problem dimensionality n and with only constant additional space. 

The proof is omitted as it is essentially a composition of the results from Section [4] 
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Algorithm 2: Linear time and constant space projection onto D. The auxiliary function *P is 
evaluated by calls to "auxiliary", which are carried out by Algorithm Q] 

Input: x G R" , X\ ,X 2 G R with < X 2 < Xi < ^/nk 2 , 

solver G {Bisection, Newton, NewtonSqr, Halley}. 

Output: proj D (x) G D where D := S { ^ h) C R| . 

// Check whether sparseness should be increased or decreased. 

1 (¥(cc), ^'(a), ¥"(a), *(a), *'(a), finished, £ h l\, d) := auxiliary (x, X h X 2 , 0); 

2 if ^(a) < then go to Line ll8i // Decrease sparseness, skip root-finding. 

// Need to increase sparseness, initialize safeguarded root-finding. 

3 lo:=0; up := max{x,- | / G {1, . . . ,«}, x,- ^max y - G{lj „}X y -}; a := lo+^(up — lo); 

4 (¥(a), »P'(a), ¥"(a), *(a), »J"(a), finished, £ b := auxiliary (x, X 2 , a); 

// Perform root-finding until correct interval has been found. 

5 while not finished do 
// Update Bisection interval, 
if 'P(a) > then lo := a else up := a; 

// One iteration of root-finding. 



if solver = Bisection then a := lo + ^up — lo); 
else// Use solvers based on derivatives, 
if solver = Newton then a := a — ; 

else if solver = NewtonSqr then a := a — ^ ; ; 



else if solver = Halley then 

<(aW"(a) , 



ft := 1 W'fni2 ; « : = max(0.5, mm(1.5, ft)); a := a 



7 
8 
9 

10 
11 
12 
13 

14 
15 

16 

17 end 

// Correct interval has been found, compute exact value for a. 
1 

18 a := - 

d 

II Compute result of the projection in-place. 

19 p :=0; 

20 for i := 1 to n do 



end 

// If OC fell out of bounds, perform normal Bisection, 
if a < lo or a > up then a := lo+^(up — lo); 

end 

// Re-evaluate auxiliary function at new position. 
(¥(01), *F'(a), ¥"(a), *(a), *'(a), finished, £ h t\, d) := auxiliary (x, X u X 2 , a); 




21 

22 



t := Xj — a; 

if t > then x; := ?; p := p + t 2 else x; := 0; 



23 end 

24 for i 

25 return x; 



24 for i := 1 to n do x,- := ^p**; 
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6 Gradient of the Projection 



We conclude our analysis of the sparseness-enforcing projection by considering its gradient: 

Lemma 13. The projection onto D can be cast as function R" () — > D in all points with unique 
projections, that is almost everywhere. Further, this function is differentiable almost everywhere. 
More precisely, let x G M" \D such that p := proj^x) is unique. With Remark|5] let a G M 

such that p = j^™*^" 0)|f • If*; 7^ OC for all j G {1, . . . ,n}, then proj D is differentiable in x. It 
is further possible to give a closed-form expression for the gradient as follows. Let the index set 
of nonzero entries in the projection be denoted by / := { i G { 1 , . . . , n} \ pi > } = { i\ , . . . , id } 
where d := |/|. Let e# G W denote the kth canonical basis vector for k G {1, . . . ,n} and let V := 
, . . . , ei d ) T G {0, 1 } dxn be the slicing matrix with respect to /, that is with x := Vx G M. d we 
have for example x T = (x,-, , . . . , x,- rf ). Write a := d \\x\\ 2 — G R>o and ft := dX\ — X^G M>o 
for short. With Lemma g] we find that a = \ (p||j - ^ii/f)- Denote by e := Ve £ {\} d the 
vector where all d entries are equal to unity, and let ^ := max (x — a-e, 0) = x — a ■ e G M> 
which implies that p = j^-V T q holds. 

Let p := such that p = V T p, then |proj D (x) = V T GV G M" xn where 



G := - -j- (^ee T + </p/5r T -A,, (g£ r + pe T )) . 

Proof. The projection is unique almost everywhere as already shown by Q. When x ( - ^ a for 
all i G {1, . . . ,«}, proj D is differentiable as composition of differentiable functions as then / is 
invariant to local changes in x. Write p := j^-q such that p = V T p, then the chain rule yields 

dp _dV T p d ( X 2 _\ d(x-a-e) dVx _ ^ T ^ d / q \ f _~<&\ y 



dx dp dq \||^|| 2 J dx dx dq \ WqW2J \ dx 



We thus only have to show that the matrix G defined here matches the matrix from the claim. 
It is easy to see that the mapping from a vector to its normalized version has a simple gradient, 

- - - - - 14 = A (* - £)• * - * - <** ~ — • 

the canonical dot product with e yields essentially their L\ norms. We hence obtain = 
e T x — ae T e = X\ a/t. Likewise, the L2 norm of q equals 



MW2 = PII2 _2oc Plli +da 2 = \\x\\l — cc f||Jc||j +^iyf) = PH2 - 2 ~ 
= i(d||^-[|*||?)+X?f = §(l + ?)=A|f. 

To compute the gradient of a, we first note that b does not depend on x but a does. It is JLa = 
2dx T -IpW^e 7 G R lxd , and hence J=^i = -L (dx 7 - ||Jc|| 1 e 7 ') G R lxd . With x = g + oc-e 
follows <ix — ||x|| j e = dq — X\ \f\e, and hence 

^ a - ±e T - -ii- ft/x r - llxll e r "l - (± 4- e r - ^-,5 T - - ^-o r F M lxrf 
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Algorithm 3: Product of the gradient of the projection onto D with an arbitrary vector. 
Input: }GR" and the following results of Algorithm |2j / = {i\ , . . . , i^} C { 1 , . . . , n}, 

d := |/|, p G M> , X[,X 2 G M>o, a :=d\\x\\l- \\x\\1 G M> and ft := dX\-X\ G M> . 

Output: z:= (|proj D (x)) • y G R". 

// Scan and slice input vector. 

1 y G {0} d ; suniy := 0; scp p v := 0; 

2 for i :=ltod do sum^ := suniy + Zij ; scp p y := scp p v + • z,-. ; gj := z tj ; 
// Compute product with gradient in sliced space. 

3 ~ z := \[\% I := z + jfr {h ■ sum? - d ■ Bcp p ,y) P; z := ~z + ^ (Xi • scp po , - ^ • sum y ~) e; 

// Unslice to yield final result. 

4 ye {0}"; for i \=\tod Aoy ij := Zfi 

5 return 



Therefore, by substitution into G and multiplying out we yield 

G = yjl - i-jf) fa - + %tf) 

= ^ fa " fee 7 + fegf - ±qf + ^ - , 

where we have used qq T ee T = q (q T e) e T = X\ \f^qe T and qq T eq T = A,i y^qq T . The claim then 
follows with t^- + tV- = - and <5 = . p. □ 

The gradient given in Lemma[T3lhas a particular simple form, as it is essentially a scaled identity 
matrix with additive combination of scaled dyadic products of simple vectors. In the situation 
where not the entire gradient but merely its product with an arbitrary vector is required, simple 
vector operations are already enough to compute the product: 

Theorem 14. Algorithm [3] computes the product of the gradient of the sparseness projection 
with an arbitrary vector in time and space linear in the problem dimensionality n. 

This claim can directly be validated using the expression from the gradient given in Lemma [T3l 

7 Experiments 

To assess the performance of the algorithm we proposed to compute sparseness-enforcing pro- 
jections, several experiments have been carried out. As the projection onto D is unique al- 
most everywhere, different approaches must compute the same result except for a null set. 
We have compared the results of the algorithm proposed by [2 | with the results of our al- 
gorithm for problem dimensionalities n G {2 2 ,...,2 26 } and for target degrees of sparseness 
a* G {0.025, 0.050, . . . , 0.950, 0.975 }. For every combination of n and a* we have sampled 
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one thousand random vectors, carried out both algorithms, and found that both algorithms pro- 
duce numerically equal results given the very same input vector. Moreover, we have numerically 
verified the gradient of the projection for the same range using the central difference quotient. 

Finally, experiments have been conducted to evaluate the choice of the solver for Algorithm [2] 
We have set the problem dimensionality to n := 1024, and then sampled one thousand random 
vectors for target sparseness degrees of o* G {0.200, 0.225, . . . , 0.950, 0.975 }. We have used 
the very same random vectors as input for all solvers, and counted the number of times the aux- 
iliary function had to be computed until the solution was found. The results of this experiment 
are depicted in Figure |2] While Bisection needs about the same number of evaluations over all 
sparseness degrees, the solvers based on the derivative of *P depend on o* in their number of 
evaluations. This is because their starting value is set to the midpoint of the initial bracket in 
Algorithm |2] and thus their distance to the root of *P naturally depends on o* . The solver that 
performs best is NewtonSqr, that is Newton's method applied to It is quite surprising that 
the methods based on derivatives perform so well, as *P' possesses several step discontinuities 
as illustrated in Figure Q] 

In the next experiment, the target sparseness degree was set to a* := 0.90 and the problem 
dimensionality n was varied in {2 2 , . . . ,2 26 }. The results are shown in Figure [3] The number of 
evaluations Bisection needs in the experiment grows about linearly in log(n). Because the ex- 
pected minimum difference of two distinct entries from a random vector gets smaller when the 
dimensionality of the random vector is increased, the expected number of function evaluations 
Bisection requires increases with problem dimensionality. In either case, the length of the inter- 
val that has to be found is always bounded from below by the machine precision such that the 
number of function evaluations with Bisection is bounded from above. The methods based on 
derivatives exhibit sublinear growth, where the solver NewtonSqr is again the best performing 
one. Note that the number of iterations it requires decreases when dimensionality is enhanced. 
This is because Hoyer's sparseness measure G is not invariant to problem dimensionality, and 
hence a sparseness of o* = 0.90 has a different notion for n = 2 26 than for n = 2 8 . 

8 Conclusion 

In this paper, we have proposed an efficient algorithm for computing sparseness-enforcing pro- 
jections with respect to Hoyer's sparseness measure a. Although the target set of the projection is 
here non-convex, methods from projections onto simplexes could be adapted in a straightforward 
way. We have rigorously proved the correctness of our proposed algorithm, and additionally we 
have yielded a simple procedure to compute its gradient. We have shown that our algorithm 
needs only little resources, and that it scales well with problem dimensionality, even for very 
high target sparseness degrees. 

Acknowledgments 

The authors would like to thank Michael Gabb for helpful discussions. This work was supported 
by Daimler AG, Germany. 



13 



12 



c 
o 

I 10 

CO 

w 8 

c 

o 

"ta 6 
c 

t 4 
2 








Bisection 
Newton 
Halley 
NewtonSqr 



0.2 0.3 0.4 0.5 0.6 0.7 0.8 

Target Degree of Sparseness 



0.9 



Figure 2: Auxiliary function evaluations needed to find the final interval with four different 
solvers. The problem dimensionality was set to n := 1024 and the target degree of 
sparseness a* was varied. While the performance of Bisection is constant over differ- 
ent values of a*, the solvers that use the derivative of the auxiliary function depend on 
the target sparseness and consistently outperform Bisection. Newton's method applied 
to *P is the best-performing solver over all choices of a* . 
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Figure 3: Same plot as in Figure |2j except for the target degree of sparseness was set to 
a* := 0.90 and the problem dimensionality was varied. The number of required 
function evaluations grows linearly with the logarithm of the problem dimensional- 
ity for Bisection, while the other solvers require a number sublinear in log(n). When 
n = 2 26 67 • 10 6 then Newton's method only requires 10 iterations in the mean, and 
Newton's method applied to *P requires only 4 iterations. 
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